Introduction to Open Data Science - course project

RStudio, GitHub & Open data sources

Text book: Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences , Second Edition. Chapman and Hall/CRC, Boca Raton, Florida, USA.

My GitHub repository: https://github.com/hkallo/IODS-project

My GitHub course diary: https://hkallo.github.io/IODS-project/

# Date of submission
date()
## [1] "Sun Nov 26 17:03:52 2023"

Assignment 1

How are you feeling right now? Feeling good, looking forward to learn new things in this course. R for Health Data Science is truly great, and I have already after reading first sections, learnt several new things which will make my scripts better and more consise. The book has very nice explanations for everything

What do you expect to learn? I already have some experience with R statistical software and basics from statistical analysis. I expect to learn about open data sources and how to utilize them as well as about GitHub, which is completely new thing to me. It is also nice to get a recap of basic things of stats and perhaps find new ways to execute things in R.

Where did you hear about the course? I saw an email.

Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? As I already have some knowledge of R, it was pretty easy and straightforward. However, if I had no experience with this language, I don’t think I would learn too much as it is so detached from actual script writing, and nothing really had to be done yourself. The book seems very nice tool to check how to execute things in R, together with assignments better though!

Which were your favorite topics? Which topics were most difficult? Favorite chapter were pipe (%<%) and several parts from Summarizing data chapter. These helped me to understand the functions that I have already used but apparently without understanding them perfectly:) It will be easier to use this in the future. I have also tried RMarkdown before but it will be nice to learn more about it. I did not know about the cheet sheets in R, that was a great tip!

Some other comments on the book and our new approach of getting started with R Markdown etc.? I guess during upcoming weeks we get to do coding ourselves so no other comments:)


Assignment 2: Regression and model validation

Name: Henna Kallo

Date: 13.11.2023

In this exercise we learn to perform data wrangling and linear regression analysis!

#Assignment submitted
date()
## [1] "Sun Nov 26 17:03:52 2023"

Data wrangling: preparing dataset for analysis

library(tidyverse); library(dplyr); library(ggplot2); library(GGally) #load libraries

data <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t",  header = TRUE) #data upload

str(data) #structure of the object
dim(data) #dimensions of the data

Next we will create a new dataset for analysis containing only needed information: gender, age, attitude, deep, stratergic and surface level question scores, and points.

#save the variables ready in the original data frame to the new analysis data frame
learningAnalysis <- data %>%
  select(gender, Age, Attitude, Points)

#find combination variables (deep, strategic and surface level questions) and save each on their own variable
deep <- data %>%
  select(starts_with("D")) #NB! Excess amount of columns selected! With the next line we include only the deep question columns.
deep <- deep[,1:12]

surf <- data %>%
  select(starts_with("SU"))

stra <- data %>%
  select(starts_with("ST"))

#averaging the answers of selected questions and saving them to the analysis dataset
learningAnalysis$deep <- rowMeans(deep)
learningAnalysis$stra <- rowMeans(stra)
learningAnalysis$surf <- rowMeans(surf)

#scale the combination variable Attitude back to the 1-5 scale by dividing with 10, and delete the old variable
learningAnalysis$attitude <- learningAnalysis$Attitude / 10
learningAnalysis <- subset(learningAnalysis, select = -Attitude)

colnames(learningAnalysis) #check the column names and convert if needed
colnames(learningAnalysis)[2] <- "age"
colnames(learningAnalysis)[3] <- "points"

learningAnalysis <- filter(learningAnalysis, points>0) #include only the rows where points > 0

#reorder the columns:
learningAnalysis <- learningAnalysis %>%
  select(gender, age, attitude, deep, stra, surf, points)

#last check
str(learningAnalysis)
head(learningAnalysis)

Now we have dataframe prepared for the subsequent analysis. Let’s save the file to the IODS Project folder.

setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory to the IODS project folder
getwd() #check that it worked

write_csv(learningAnalysis, file= "learning2014.csv") #save file as csv

Data analysis: explore, analyze & interpret

Summary of the dataset

setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory

learningAnalysis <- read.csv("learning2014.csv", header=TRUE) #read file into R

str(learningAnalysis) #structure of the data frame
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learningAnalysis) #first rows of the data frame
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
summary(learningAnalysis) #summary of the variables
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Data description:

  • Research question: Does students’ gender/age/attitude/question scores impact the success in the course completion (gained points)? This is measured with several different level (deep, strategic, surface) questions. Set of questions are used to quantify attitude on scale 1-5. Points represent the points that students have gained from the course. The data also includes information of students age and gender.

  • Data frame structure: There are 166 observations in each variable. There are total 7 variables: gender (character), age (integer), attitude (numeric), deep (numeric), strategic (numeric) and surface (numeric) level questions, and points (integer).

  • The summary table above describes summaries of variables: min, max, mean, median and 1st and 3rd quartiles.

    • There are 110 females and 56 men in this study.

    • The students are aged from 17 up to 55.

    • Attitude scores gained range between 1.4-5.0, average being 3.14.

    • The mean scores of deep, strategic and surface questions are 3.68, 3.12 and 2.79, respectively.

    • Minimum points gained is 7.00, maximum points 33. Average of points is 22.72. 50% of all scores fit to the range 19.0-27.75 points.

Linear regression analysis: Is there association/ dependency between points (dependent variable) and age/attitude/deep/stra/surf (explanatory) variables?

To test this, we will perform regression analysis, which is a statistical method that describes the relationship between two or more variables.

Graphical overview of the data:

#Function for adding correlation panel
panel.cor <- function(x, y, ...){
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- round(cor(x, y), digits=2)
  txt <- paste0("R = ", r)
  cex.cor <- 0.8/strwidth(txt) #if want to adjust text size based on R value
  text(0.5, 0.5, txt, cex = 1)} #if adjusting; cex=cex.cor

#Function for adding regression line
upper_panel_regression_line = function(x,y, ...){
  points(x,y,...)
  linear_regression = lm(y~x)
  linear_regression_line = abline(linear_regression)}

my_cols <- c("#00AFBB", "#E7B800") #set colors

learningAnalysis$gender<-as.factor(learningAnalysis$gender) #for being able to change color, convert gender to factor type

#Visualization with a scatter plot + regression line matrix, add R values to the lower side of the matrix.
pairs(learningAnalysis[-1], col = my_cols[learningAnalysis$gender],
      lower.panel = panel.cor , upper.panel = upper_panel_regression_line)

Scatter plot shows the distribution of the observations (female turquoise, male yellow). Regression lines give some indication whether there is or isn’t an association between two variables. For instance, there seems to be positive dependency between the attitude and points: line goes upwards and it is steeper than any other line. Also, R-value (correlation coefficient) for points vs. attitude is 0,44, suggesting positive correlation between these variables. If the slope is (close to) horizontal and R is (close to) zero, it means that there is no association between the variables. This kind of case is for example between deep and points variables. Then again, if R-value is negative and the slope of regression line is negative (line goes downhill), like between surf and points variables, the association is negative. This means that when the surface question gets higher value, the points are more likely lower. Thus, with this kind of modelling we can also predict the success at the course based on the answers to the preliminary questions. Better explanation of R-values and their meaning comes after the next figure.

Let’s check if analysis executed separately in male and female reveals differences in association of the variables.

# create a more advanced plot matrix with ggpairs()
ggpairs(learningAnalysis, mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20)))

Let’s go through the figure. Red = women, Blue = men.

There are about two times more women in this study.

The boxplots reveal that the median age of women is a bit lower than that for men. Also, women median score of attitude is a bit lower. Strategic & surface question scores in turn are slightly higher in women.

Also density plots reveal the same. They also show that distribution of attitude and surface question scores are quite different in male and female. Density plots of age reveal that the data is right skewed. This means that there are more young participants than older ones. Age, Stra and surf have clearly unimodal distribution (=only one peak). Other variables have 1-2 peaks, sometimes depending on the gender. The scatter plot also shows the distribution of values. The same applies with the histograms. The clearest way to make conclusions from the distributions is still with the density plot.

As we are focused on studying the causal relationship between dependent variable ´points´ and explanatory variables age, attitude, deep, stra and surf, let’s focus on right most column of correlation coefficient values (measures linear correlation between two sets of data). This article presents the rule of thumb (Mukaka, 2012; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/) how to interpret different coefficient values. The strongest association is between attitude and points. This association is strong also in both genders separately. Interstingly, age seems to have stronger association with points in men than in women. This assocation is negative, meaning that older age is associated with lower scores. The next biggest impact seems to be with stra and surf variables, but their coefficients are already quite low and not statistically significant.

Multiple regression model

Now we select 3 explanatory variables to explain dependent variable “points”. This selection is based on the slopes of regression lines and R values in the figures above. The 3 highest absolute R values are selected: attitude (R=0.44), stra (R=0.15) and surf (R=-0.14) (genders not separated in the analysis).

#Multiple regression analysis
# create a regression model with multiple explanatory variables
my_model<- lm(points ~ attitude + stra + surf, data = learningAnalysis)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learningAnalysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

P-value of the F-statistic is 3.156e-08, which is highly significant. This means that at least one of the predictor variables (attitude, stra, surf) is significantly related to outcome variable.

To identify which predictor variables are significant, let’s examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statistic p-values. Attitude is significantly associated with points whereas variables stra and surf show no association. Thus, we can remove these to variables from the model. Coefficient b for attitude is ~3.40, meaning that this is the average effect on our dependent variable (points) when the predictor (attitude) increases with one unit and all the other predictors do not change.

my_model_attitude<- lm(points ~ attitude, data = learningAnalysis)

# print out a summary of the model
summary(my_model_attitude)
## 
## Call:
## lm(formula = points ~ attitude, data = learningAnalysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

P-value of F-statistics is significant (4.119e-09), and the model tells that when attitude grows with 1 unit, points increase about 3.5 on average. The final equation would be: points = 11.6 + attitude*3.5.

Out of curiosity, before proceeding to quality assessment, let’s check if age should be included in the model when analyzing only explanatory variables of points in men.

menstudents <- learningAnalysis %>%
  filter(gender=="M")

my_model_men<- lm(points ~ attitude + age, data = menstudents)

summary(my_model_men) #summary of the model
## 
## Call:
## lm(formula = points ~ attitude + age, data = menstudents)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6798  -3.2162   0.5482   2.8711   9.3838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.66207    4.56036   2.996 0.004156 ** 
## attitude     4.10182    1.10353   3.717 0.000487 ***
## age         -0.16050    0.08465  -1.896 0.063418 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.285 on 53 degrees of freedom
## Multiple R-squared:  0.2538, Adjusted R-squared:  0.2256 
## F-statistic: 9.013 on 2 and 53 DF,  p-value: 0.0004272

Interestingly, this analysis indicates that age might be associated (negatively) with points in men: the older the men, the less points. P-value 0.06 is very close to statistical significance (p<0.05). However, let’s not continue further with this dataset but rather analyse both men and women together.

Quality assessment

Finally, we should perform quality assessment of the model, based on R-squared (R2) and Residual Standard Error (RSE). R2 is sensitive for the number of variables included in the model and it is adjusted to correct for the number of explanatory variables included in the prediction model. The adjusted R2 = 0.1856, meaning that “~19% of the variance in the measure of points can be predicted by attitude score. If we compare the adjusted R2 value to the previous model where we had 3 predictor variables, the values are not very different, meaning that having 3 predictors in the model did not improve the quality of the model.

#error rate
summary(my_model_attitude)$sigma/mean(learningAnalysis$points)
## [1] 0.2341752

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model is. Here we calculated the error rate by dividing the RSE by the mean of outcome variable. Thus, RSE 5.32 is corresponding to 23% error rate.

Last thing to do is to graphically explore the validity of our model assumptions by Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage plot. Residual is the difference between the observed value and the mean value that the model predicts for that observation.

# draw diagnostic plots using the plot() function
par(mfrow = c(2,2))
plot(my_model_attitude, which=c(1,2,5))
par(mfrow = c(1, 1)) #reset plotting matrix: 

  • Residuals vs Fitted values plot: detect unequal error variances, non-linearity, and outliers.
    • On the right end of x axis, there is some indication of heteroskedasticity: the spread of the residuals seems decreasing. However, the variance of the residuals seems otherwise somewhat even, so I conclude that this is ok amount of variation.
    • The horizontal band (red line) is formed around the zero line. Thus, it is reasonable to assume linear relationship.
    • There are 3 outliers. One option would be to further evaluate if (some of) these three observations should be removed from the analysis.
  • Normal QQ-plot: provide an indication of univariate normality of the dataset.
    • From the figure we observe that the normal probability plot of the residuals is approximately linear supporting the condition that the error terms are normally distributed.
  • Residuals vs Leverage plot: identify influential data points on the model.
    • There are three data points highlighted; one of them raised already in two previous plots as outlier. It might be good idea to further study the influence of these observations on the slope of the regression line (https://rpubs.com/Amrabdelhamed611/669768). However, as the points are not flagged by the Cook’s distance, they are most likely not too influental, and thus can be included in the analysis.

Conclusion: The final course points of the students are positively associated with the attitude scores based on the preliminary question asked before the course: the higher attitude score, the more points. Only one explanatory variable is included in the model as rest were not reaching significance. Quality assessment reavealed that our model is reliable.


Assignment 3: Logistic regression

Name: Henna Kallo

Date: 18.11.2023

In this exercise we learn to perform data wrangling and linear regression analysis!

We are exploring data from two questionnaires related to student performance in secondary education of two Portuguese schools. The data includes student grades, demographic, social and school related variables. Here we have combined data sets of Mathematics and Portuguese language subjects.

Here we study the relationships between alcohol consumption with selected variables.

Data source: http://www.archive.ics.uci.edu/dataset/320/student+performance

#import data
library(readr)
alc<-read_csv("C:/Users/Henna/Desktop/IODS/IODS-project/Data/student_performance_alcohol.csv") 
alc<-as.data.frame(alc);

colnames(alc) #column names
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
str(alc) #structure of the dataset
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : num  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : num  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : num  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : num  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : num  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

All the variables listed. The variables not used for joining the two data have been combined by averaging (including the grade variables). Alcohol use (‘alc_use’) is the average of workday and weekend alcohol consumption. If average is higher than 2, alcohol consumption is considered ‘high use’. Rest of the variables are describes in the website (link given above).

We have 370 obsrevations and 35 variables in the dataframe. There are character, numerical and logistic type of variables.

Next we select 4 intersting variables to further study their relationship with the alcohol consumption. Let’s visualize the variables:

library(tidyr); library(ggplot2); library(dplyr);
gather(alc) %>% ggplot(aes(value))  + 
  facet_wrap("key", scales = "free") + 
  geom_bar(fill="#00AFBB") +
  ggtitle("Barplots of all variables")

Below are listed the chosen 4 factors, brief description of variable, and expected relationship with alcohol consumption

  • Goout
    • going out with friends, scoring 1-5
    • hypothesis: there is a positive relationship between ‘goout’ and ‘high_use’: higher value of the goout is linked with heavier alcohol consumption.
  • Absences
    • number of school absences
    • hypothesis: there is a positive relationship between ‘absences’ and ‘high_use’: higher value of absences is linked with heavier alcohol consumption.
  • Failures
    • number of past class failures
    • hypothesis: there is a positive relationship between ‘failures’ and ‘high_use’: higher value of failures is linked with heavier alcohol consumption.
  • Romantic
    • with a romantic relationship (binary: yes or no)
    • hypothesis: there might be an association between variables ‘romantic’ and ‘high_use’: single status linked with high_use=TRUE.

Next we will see whether we can find support for our hypotheses with numerical and graphical exploration of data:

Density or frequency plots (depending on the variable), as well as cross-tabulations and plots to visualize

ggplot(alc, aes(x=goout)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("GoOut distribution") +
  theme_classic()

# create cross-tabulations and plots to visualize
library(sjPlot)

tab_xtab(var.row = alc$goout, var.col = alc$high_use, title = "Cross-Tab of GoOut & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of GoOut & High alcohol consumption
goout high_use Total
FALSE TRUE
1 19
86.4 %
3
13.6 %
22
100 %
2 82
84.5 %
15
15.5 %
97
100 %
3 97
80.8 %
23
19.2 %
120
100 %
4 40
51.3 %
38
48.7 %
78
100 %
5 21
39.6 %
32
60.4 %
53
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=55.574 · df=4 · Cramer’s V=0.388 · p=0.000
plot_xtab(alc$goout, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

Density plot shown that most of the students get ‘goout’ scoring 2-4. Cross-tabulation shows that the bigger is the ‘goout’ score, the bigger proportion there is of observations with high_use=TRUE. This indicates that our hypothesis was correct: higher value of the goout is linked with heavier alcohol consumption.

ggplot(alc, aes(x=absences)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Absences distribution") +
  theme_classic()

tab_xtab(var.row = alc$absences, var.col = alc$high_use, title = "Cross-Tab of Absences & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of Absences & High alcohol consumption
absences high_use Total
FALSE TRUE
0 50
79.4 %
13
20.6 %
63
100 %
1 37
74 %
13
26 %
50
100 %
2 40
71.4 %
16
28.6 %
56
100 %
3 31
81.6 %
7
18.4 %
38
100 %
4 24
68.6 %
11
31.4 %
35
100 %
5 16
72.7 %
6
27.3 %
22
100 %
6 16
76.2 %
5
23.8 %
21
100 %
7 9
75 %
3
25 %
12
100 %
8 14
70 %
6
30 %
20
100 %
9 5
45.5 %
6
54.5 %
11
100 %
10 5
71.4 %
2
28.6 %
7
100 %
11 2
40 %
3
60 %
5
100 %
12 3
42.9 %
4
57.1 %
7
100 %
13 1
50 %
1
50 %
2
100 %
14 1
14.3 %
6
85.7 %
7
100 %
16 0
0 %
1
100 %
1
100 %
17 0
0 %
1
100 %
1
100 %
18 1
50 %
1
50 %
2
100 %
19 0
0 %
1
100 %
1
100 %
20 2
100 %
0
0 %
2
100 %
21 1
50 %
1
50 %
2
100 %
26 0
0 %
1
100 %
1
100 %
27 0
0 %
1
100 %
1
100 %
29 0
0 %
1
100 %
1
100 %
44 0
0 %
1
100 %
1
100 %
45 1
100 %
0
0 %
1
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=43.001 · df=25 · Cramer’s V=0.341 · Fisher’s p=0.008
plot_xtab(alc$absences, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

Density plot shown that most of the student have absence score zero (right skewed). Cross-tabulation shows that the more absences there are, the bigger proportion tend to have higher alcohol consumption. This indicates that our hypothesis was correct: higher number of absences is linked with heavier alcohol consumption.

ggplot(alc, aes(x=failures)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Failures distribution") +
  theme_classic()

tab_xtab(var.row = alc$failures, var.col = alc$high_use, title = "Cross-Tab of failures & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of failures & High alcohol consumption
failures high_use Total
FALSE TRUE
0 238
73.2 %
87
26.8 %
325
100 %
1 12
50 %
12
50 %
24
100 %
2 8
47.1 %
9
52.9 %
17
100 %
3 1
25 %
3
75 %
4
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=14.304 · df=3 · Cramer’s V=0.197 · Fisher’s p=0.002
plot_xtab(alc$failures, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

Also the density plot of failures is right skewed, meaning that most of the student pass the class Cross-tabulation shows that higher number of failures is linked with the bigger proportion of high alcohol consumption. This indicates that our hypothesis was correct: higher value of failures is linked with increase in heavy alcohol consumption.

ggplot(alc, aes(x=romantic)) +
  geom_bar(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Romantic relationship status frequencies") +
  theme_classic()

tab_xtab(var.row = alc$romantic, var.col = alc$high_use, title = "Cross-Tab of Romantic relationship status & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of Romantic relationship status & High alcohol consumption
romantic high_use Total
FALSE TRUE
no 173
68.9 %
78
31.1 %
251
100 %
yes 86
72.3 %
33
27.7 %
119
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=0.286 · df=1 · φ=0.034 · p=0.593
plot_xtab(alc$romantic, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

The frequency table shows that there is about 1/3 of students in romantic relationship whereas ~2/3 are with single status. Cross-tabulation indicates that even though there is slightly bigger percentage of single + high alc use students, this difference is most likely not statistically meaningful. Thus, this results does not support our hypothesis stating ‘there might be an association between variables ’romantic’ and ‘high_use’’.

We are interested whether the alcohol consumption has an association with the chosen set of characteristics of the students. Binary logistic regression can tell us the probability of this. We choose binary logistic regression because the outcome variable, ‘high_use’, has two level (TRUE/FALSE). Explanatory variables can be other types as well.

#binary logistic regression
m <- glm(high_use ~ failures + absences + goout + romantic, data = alc, family = "binomial")

summary(m) # a summary of the model
## 
## Call:
## glm(formula = high_use ~ failures + absences + goout + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0472  -0.7510  -0.5266   0.8886   2.3474  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.55115    0.44211  -8.032 9.57e-16 ***
## failures     0.56492    0.22385   2.524 0.011612 *  
## absences     0.07762    0.02243   3.461 0.000538 ***
## goout        0.70634    0.11846   5.963 2.48e-09 ***
## romanticyes -0.35224    0.27737  -1.270 0.204111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 382.41  on 365  degrees of freedom
## AIC: 392.41
## 
## Number of Fisher Scoring iterations: 4

Let’s go through the output:

First we have the distribution of the deviance residuals.

The next we get the coefficients, their standard errors, the z-statistic, and the associated p-values. Failures, absences and goout are statistically significant. Variable ‘romantic’ is non-significant.

The logistic regression coefficients give the change in the log odds of the outcome (high_use) for a one unit increase in the predictor variable: + for every one unit change in failures, the log odds of high_use=yes increases by 0.56. + for every one unit change in absences, the log odds of high_use=yes increases by 0.08. + for every one unit change in goout, the log odds of high_use=yes increases by 0.71.

From the results we can construct the logistic regression equation (leave out statistically non-significant variable):

log(odds[high_use=yes]) = -3.55115 + 0.56492 * failures + 0.07762 * absences + 0.70634 * goout

Next we will compute the odds ratios (OR) and confidence intervalss (CI):

coef(m) # the coefficients of the model
## (Intercept)    failures    absences       goout romanticyes 
## -3.55114961  0.56492346  0.07762078  0.70634191 -0.35223716
OR <- coef(m) %>% exp # compute odds ratios (OR)

CI <-  confint(m) %>% exp # compute confidence intervals (CI)

cbind(OR, CI) # print out the odds ratios with their confidence intervals
##                     OR      2.5 %     97.5 %
## (Intercept) 0.02869164 0.01165358 0.06618211
## failures    1.75931312 1.14014676 2.75402869
## absences    1.08071276 1.03551949 1.13208308
## goout       2.02656433 1.61539789 2.57288191
## romanticyes 0.70311335 0.40369796 1.20112415

We can conclude the following: - for one unit increase in failures, the odds of having high alcohol consumption increases by factor of 1.76. (the outcome is 76% more likely) - for one unit increase in absences, the odds of having high alcohol consumption increases by factor of 1.08. (the outcome is 8% more likely) - for one unit increase in goout, the odds of having high alcohol consumption increases by factor of 2.03. (there is a doubling of the odds of the outcome)

CI is used to estimate the precision of the OR. A large CI indicates a low level of precision of the OR, whereas a small CI indicates a higher precision of the OR.

These results support our hypotheses of the effects of failures, absences and goouts, but not about the effect of romantic relationship status.

Next, we will explore the predictive power of the model. We will include only the statistically significant variables: failures, absences & goout.

m_pred <- glm(high_use ~ failures + absences + goout, data = alc, family = "binomial")

probabilities <- predict(m_pred, type = "response") # predict the probability of high_use

alc <- mutate(alc, probability = probabilities) # add the predicted probabilities to 'alc'

alc <- mutate(alc, prediction = probability > 0.5) # use the probabilities to make a prediction of high_use

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, goout, high_use, probability, prediction) %>% tail(10)
##     failures absences goout high_use probability prediction
## 361        0        3     3    FALSE  0.21435314      FALSE
## 362        1        0     2    FALSE  0.15791281      FALSE
## 363        1        7     3     TRUE  0.38937775      FALSE
## 364        0        1     3    FALSE  0.19036196      FALSE
## 365        0        6     3    FALSE  0.25431749      FALSE
## 366        1        2     2    FALSE  0.17871716      FALSE
## 367        0        2     4    FALSE  0.33847888      FALSE
## 368        0        3     1    FALSE  0.06266341      FALSE
## 369        0        4     5     TRUE  0.54534711       TRUE
## 370        0        2     1     TRUE  0.05843362      FALSE
table(high_use = alc$high_use, prediction = alc$prediction) # tabulate the target variable versus the predictions
##         prediction
## high_use FALSE TRUE
##    FALSE   237   22
##    TRUE     66   45
ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ #plot of 'high_use' versus 'probability' in 'alc'
  geom_point()+
  ggtitle("Plotted confusion matrix results ")

The printout of the dataframe shows the new columns; predicted probabilities and prediction of high_use.

A confusion matrix visualizes and summarizes the performance of a classification algorithm: + true negatives: 237 + true positives: 45 + false positives (type I error): 22 + false negatives (type II error): 66

Next, let’s compute the average number of incorrect predictions. The mean of incorrectly classified observations can be thought of as a penalty (loss) function for the classifier. Less penalty = good.

table(high_use = alc$high_use, prediction = alc$prediction) %>% # tabulate the target variable versus the predictions
  prop.table() %>%
  addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64054054 0.05945946 0.70000000
##    TRUE  0.17837838 0.12162162 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000
loss_func <- function(class, prob) { # define a loss function (mean prediction error)
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability) #compute the average number of wrong predictions in the (training) data
## [1] 0.2378378

This analysis revealed that the average number of wrong predictions is ~24%.

Now we continue to the 10-fold cross-validation of the model

# K-fold cross-validation
library(boot)
#trainingdata
cv_train <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = nrow(alc))

#testingdata
cv_test <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = 10)

# average number of wrong predictions
cv_train$delta[1]
## [1] 0.2405405
cv_test$delta[1]
## [1] 0.2459459

The comparison of the average number of the wrong predictions in training and testing sets, we see that the error is about the same. The error is slightly smaller than in the exercise set, meaning that including failures, absences and goouts is a bit better model to predict the alcohol use than what was used in the exercise set.


Assignment 4: Clustering & Classification

Name: Henna Kallo

Date: 26.11.2023

In this exercise we learn data clustering and classification

  • Data source: MASS package in R: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

  • Literature: Part IV of “MABS4IODS” (chapters 12, 17 & 18)

  • Important concepts:

    • Classification: organizing a large, complex set of multivariate data. Objects are sorted into a small number of homogeneous groups or clusters.
    • Multivariate data: several measurements or observations are made on each sampling unit, and they are considered simultaneously to reveal the patterns or so. No division to explanatory and dependent variables.
library(MASS); library(tidyr); library(corrplot); library(dplyr); library(ggplot2) #load libraries
rm(list=ls())
data("Boston") #load the data

Let’s take a look of the Boston data:

str(Boston) #structure of the dataset
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
  • Boston dataset consists multivariate data of Housing values in suburdbs of Boston. The data frame has 506 rows and 14 variables. All variables are numeric type. Brief description of the variables:
    • crim: crime rates
    • zn: proportions of residental land zoned for lots over 25000 sq. ft.
    • indus: proportion of other than retail business acres
    • chas: next to Charles River, or not
    • nox: pollution; nitrogen oxides
    • rm: number of rooms per apartment
    • age: proportion of owner-occupied units built before 1940
    • dis: distances to the employment centres (weighted mean)
    • rad: access to radial highways
    • tax: full-value property-tax rate
    • ptratio: student/teacher ratio
    • black: 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town.
    • lstat: lower status of the population (%)
    • medv: median value of owner-occupied homes in $1000s

With this data set we can explore connections between economical, environmental, societal and cultural features.

The descriptions of the variables are listed in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html.

Let’s explore the summary statistics of the multivariate Boston data:

– each of the variables separately

– relationships between the variables (correlations)

summary(Boston) #summary of each variable separately
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
  • Some interesting points:
    • Mean crime rate is 3.61 per capita. Variation is very large as min rate is almost 0, whereas maximum crime rate is close to 90 crimes per capita.
    • Nitrogen oxygen concentration vary from 0.39 to 0.87 part per 10 million.
    • Average number of rooms in apartment is about 6, but it varies from 3,5 to almost 9.
    • On average, almost 70% of owner-occupied units are built before 1940. However, there’s lots of variability.
    • full-value property-tax rate varies from 187 to 711 (pe $10000)
    • There are on average 19 students per teacher.
    • The amount of blacks varies a lot but their proportion is relatively high on most of the regions.
    • Lower status of the population varies between 1,73-37,97%. On average, a bit more than 20% have lower status.
    • Mean of the median values of owner occupied homes is 22530$.

The scatterplot of each variable-combination can be use as the first indicative visualization of associations between the variables. In addition to this, we print an illustrative correlation matrix visualization, which presents us nicely how strong is the association between the variable-pairs, and whether the association is positive or negative.

pairs(Boston,
      col = "cornflowerblue",
      main = "Scatterplots for each variable-combination of Boston data frame")

cor_matrix <- cor(Boston) %>% #correlation matrix
  round(digits=2)

corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6) ##visualize the correlations

The crime rate seems to be slightly positively associated with proportion of owner-occupied units built prior to 1940 and lower status of the population. Also, as accessibility to radial highways gets better and property-tax rate increases, the crime rates per capita increases.

Not that surprisingly, there is an association between industry and nitrogen oxygen levels. Moreover, higher nitrogen oxygen concentration seems to be associated with higher proportion of owner-occupied units built prior to 1940 & lower population status. The scatterplot indicates that the air pollution concentration and industry variables are in turn negatively associated with the distance to the Boston employment centres and median value of owner-occupied homes.

Lower number of rooms/apartment seems to be linked with lower population status. As expected, average number of rooms is positively associated with the median value of owner occupied homes. Lower status of the population is clearly linked with reduced median value of owner-occupied homes.

Moreover, there is a negative association between proportion of owner-occupied unit built prior to 1940 with distance to employment centres, and a positive correlation between the accessibility to radial highways and full-value property-tax rate per $10000.

The variables are on very different scales. We will make them more comparable by standardizing the dataset.

boston_scaled <- scale(Boston) # center and standardize variables

summary(boston_scaled) # summaries of the scaled variables
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

When scaling the data, we subtract the column means from the corresponding columns and divide the difference with standard deviation. This is why the mean is 0 in all of the variables. After scaling (centering & standardizing) we can better compare the variables with each other. This website briefly lists the cons of centering and scaling the variables: https://www.goldsteinepi.com/blog/thewhyandwhenofcenteringcontinuouspredictorsinregressionmodeling/index.html

Classification

Linear discriminant analysis

“A further question that is often of interest for grouped multivariate data is whether or not it is possible to use the measurements made to construct a classification rule derived from the original observations (the training set) that will allow new individuals having the same set of measurements (the test sample).” -Part IV of “MABS4IODS” -> discriminant function analysis (chapter 18)

Now will perform linear discriminant analysis: the idea is to find a linear combination of features that characterizes or separates two or more classes of objects or events. When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.

Our target variable is crime rate per capita by town. Rest of the variables are predictor variables. Our interest lies in deriving a classification rule that could use measurements of the predictor variables to be able to identify the individual’s placement on the categories of crim variable.

We start with converting the crim variable to categorical and dividing the data into four categories: low, med_low, med_high and high crime rates per capita.

class(boston_scaled) # class of the boston_scaled object
## [1] "matrix" "array"
boston_scaled<-as.data.frame(boston_scaled) # change the object to data frame

#Create a factor variable from numerical: binning by quantiles; variable `crim` (per capita crime rate by town)

summary(boston_scaled$crim) #summary
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins<-quantile(boston_scaled$crim) #save quantiles, bin limits, in 'bins'
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Next we’ll divide the data into train (80% of the data) and test sets.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data; save for later to calculate how well the model performed in prediction
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

And finally perform the analysis and plot the results:

set.seed(123)
lda.fit <- lda(crime~., data = train)

lda.fit # print the lda.fit object
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2400990 0.2475248 0.2599010 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       1.02691124 -0.8925703 -0.079333958 -0.9001291  0.46186273 -0.9227692
## med_low  -0.06288915 -0.2401781  0.011791568 -0.5381983 -0.14024936 -0.2851063
## med_high -0.38261997  0.1663357  0.278864965  0.3712962  0.09714482  0.4430410
## high     -0.48724019  1.0170492 -0.009855719  1.0363160 -0.35876221  0.8131380
##                 dis        rad        tax    ptratio       black        lstat
## low       0.9064412 -0.6947544 -0.7434325 -0.4223465  0.37795855 -0.767231531
## med_low   0.3383899 -0.5473481 -0.4451896 -0.0294608  0.34690459 -0.109555364
## med_high -0.3614090 -0.4524279 -0.3438243 -0.2598373  0.08615784  0.008337242
## high     -0.8556434  1.6388211  1.5145512  0.7815834 -0.81439967  0.924683601
##                 medv
## low       0.52419927
## med_low  -0.01118933
## med_high  0.16409444
## high     -0.68887591
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11190869  0.70503617 -0.89220509
## indus    0.05692257 -0.03176740  0.30292827
## chas    -0.10205539 -0.07611978  0.06517590
## nox      0.36947641 -0.71672595 -1.39968094
## rm      -0.15755420 -0.03806161 -0.18356450
## age      0.19807599 -0.37258245 -0.14949317
## dis     -0.03159638 -0.17851946  0.08271081
## rad      3.68776285  1.02341687 -0.17504321
## tax      0.03714716 -0.15864637  0.71011407
## ptratio  0.12597155 -0.01172305 -0.25822042
## black   -0.10478341  0.05472561  0.20858029
## lstat    0.19629225 -0.19460411  0.29784999
## medv     0.20175163 -0.35152917 -0.17941522
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9597 0.0312 0.0091
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
                   x1 = myscale * heads[,choices[1]], 
                   y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.2)

  • The biplot:
    • The cosine of the angle between a vector and an axis indicates the importance of the contribution of the corresponding variable to the principal component
    • variables are shown as arrows from the origin and observations are shown as points. The configuration of arrows reflects the relations of the variables. The cosine of the angle between the arrows reflects the correlation between the variables they represent. The most influential linear separators for the clusters are zn, rad & nox.

Next, we use predict() to classify the LDA-transformed testing data. Finally, we calculate the accuracy of the LDA model by comparing the predicted classes with the true classes.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      11        0    0
##   med_low    4      21        4    0
##   med_high   0       6       17    3
##   high       0       0        0   22

Let’s calculate the accuracy of the prediction: (15+15+19+31)/102 * 100% = ~78% of the predictions are correct ( ~22% of observations are misclassified).

Clustering

Cluster analysis is a generic term for a wide range of numerical methods with the common goal of uncovering or discovering groups or clusters of observations that are homogeneous and separated from other groups. Clusters are identified by the assessment of the relative distances between points. Clustering means that some points (or observations) of the data are in some sense closer to each other than some other points.

Classifcation starts with calculating interindividual distance matrix or similarity matrix. There are many ways to calculate distances or similarities between pairs of individuals, here we use Euclidean distance. After calculating distances, we proceed to run the k-means algorithm.

K-means clustering is a unsupervised method, that assigns observations to groups or clusters based on similarity of the objects

data("Boston") #load the data again

boston_scaled_2 <- scale(Boston) # scaling variables
boston_scaled_2<-as.data.frame(boston_scaled_2) # change the object to data frame

dist_eu <- dist(boston_scaled_2) # euclidean distance matrix

summary(dist_eu) # look at the summary of the distances
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <- kmeans(boston_scaled_2, centers = 3) #centers = number of clusters

# plot the Boston dataset with clusters
pairs(boston_scaled_2, col = km$cluster) 

Summary table of eucledian distances show the min, 1st quartile, median, mean, 3rd quartile and maximum of the distances between two points.

Let’s determine the optimal number of clusters

When plotting the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically

#K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled_2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend at k = 6. This bend indicates that additional clusters beyond the sixth have little value. In the next section, we’ll classify the observations into 6 clusters. (more info: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/)

km <- kmeans(boston_scaled_2, centers = 6) # k-means clustering

print(km)
## K-means clustering with 6 clusters of sizes 49, 47, 34, 78, 124, 174
## 
## Cluster means:
##         crim         zn      indus       chas        nox         rm         age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554  1.5399833  0.07933756
## 2 -0.2834643 -0.4872402  1.5965043 -0.2723291  1.0453673 -0.6245590  0.94375129
## 3 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681  0.37213224
## 4 -0.4126954  1.9952361 -1.1032525 -0.2218534 -1.1552982  0.6080657 -1.40363754
## 5  1.1156495 -0.4872402  1.0149946 -0.2723291  0.9916473 -0.4276828  0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
##          dis          rad         tax    ptratio       black      lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738  0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809  0.03772813 -0.2084479 -0.08717824  0.7185475
## 3 -0.4033382  0.001081444 -0.09756330 -0.3924585  0.17154271 -0.1643525
## 4  1.5772490 -0.627024335 -0.57588304 -0.7161406  0.35335146 -0.9028481
## 5 -0.8170870  1.659602895  1.52941294  0.8057784 -0.81154619  0.9129958
## 6  0.2769540 -0.587168164 -0.59021294  0.1702603  0.30993538 -0.1365045
##         medv
## 1  1.6147330
## 2 -0.5152915
## 3  0.5733409
## 4  0.6621944
## 5 -0.7713403
## 6 -0.1747229
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   6   1   1   1   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   4 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   4   6   6   6   6   6   6   6   6   6   6   6   4   4   4   4   4   4   4   6 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   6   6   6   4   4   4   4   6   6   6   6   6   6   6   6   6   6   6   6   6 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   4   6   6   6   6   6   6   6   1   1   6   6   6   6   6   6   6   1   1   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   2   3   2   2   2   2   2   2   2   2   2   3   2   3   3   2   1   2   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   3   1   3   3   2   2   1   2   2   2   2   2   6   6   6   1   6   6   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   1   1   1   1   6   6   1   4   4   4   4   4   4   4   4   4   4   4   4   4 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##   4   4   4   4   4   6   6   6   3   3   3   3   3   6   6   6   3   1   3   3 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   3   3   3   1   1   1   1   1   1   1   6   1   1   1   3   6   3   1   4   4 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   4   6   4   4   6   6   4   6   6   4   4   4   4   4   4   4   4   1   1   1 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   1   1   1   1   1   6   1   1   1   3   6   6   6   3   3   4   3   3   4   1 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   1   1   3   4   4   4   4   4   4   4   4   4   4   6   6   6   6   6   4   4 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   4   4   4   4   1   6   1   1   6   6   6   6   6   6   6   6   6   6   6   6 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   6   6   6   6   6   6   6   6   6   6   6   4   4   6   6   6   6   6   6   6 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   6   4   6   4   4   6   6   4   4   4   4   4   4   4   4   4   3   3   3   5 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   5   5   5   3   3   5   5   5   5   3   3   5   3   5   5   5   5   5   5   5 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   5   5   5   5   5   5   5   5   2   2   2   2   2   6   6   6   6   6   6   6 
## 501 502 503 504 505 506 
##   6   6   6   6   6   6 
## 
## Within cluster sum of squares by cluster:
## [1] 209.5683 263.8282 340.7321 363.4559 984.9444 590.5735
##  (between_SS / total_SS =  61.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
  • The k-means printed output displays:
    • sizes of the clusters
    • the cluster means or centers: a matrix, which rows are cluster number (1 to 6) and columns are variables
    • the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

Let’s visualize the clusters with the pairs() function:

pairs(boston_scaled_2, col = km$cluster)

  • Here are some observations of the clustering visualization:
    • better accessibility to radial highways and property-tax rate increase seem to cluster together with high crime rate
    • lower median value of owner-occupied homes are classified to the same clusters with increasing crime rates
    • Variables industry and nitrogen form at least 3 clear clusters together.
    • Low status of the population is clustered with reduced median value of owner-occupied homes.
Bonus task:
data("Boston")
boston_scaled_bonus <- scale(Boston) # scaling
boston_scaled_bonus<-as.data.frame(boston_scaled_bonus) # change the object to data frame

#clusters: km$cluster

#LDA
set.seed(123)
lda.fit_bonus <- lda(km$cluster~., data = boston_scaled_bonus)

lda.fit_bonus # print the lda.fit object
## Call:
## lda(km$cluster ~ ., data = boston_scaled_bonus)
## 
## Prior probabilities of groups:
##          1          2          3          4          5          6 
## 0.09683794 0.09288538 0.06719368 0.15415020 0.24505929 0.34387352 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm         age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554  1.5399833  0.07933756
## 2 -0.2834643 -0.4872402  1.5965043 -0.2723291  1.0453673 -0.6245590  0.94375129
## 3 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681  0.37213224
## 4 -0.4126954  1.9952361 -1.1032525 -0.2218534 -1.1552982  0.6080657 -1.40363754
## 5  1.1156495 -0.4872402  1.0149946 -0.2723291  0.9916473 -0.4276828  0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
##          dis          rad         tax    ptratio       black      lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738  0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809  0.03772813 -0.2084479 -0.08717824  0.7185475
## 3 -0.4033382  0.001081444 -0.09756330 -0.3924585  0.17154271 -0.1643525
## 4  1.5772490 -0.627024335 -0.57588304 -0.7161406  0.35335146 -0.9028481
## 5 -0.8170870  1.659602895  1.52941294  0.8057784 -0.81154619  0.9129958
## 6  0.2769540 -0.587168164 -0.59021294  0.1702603  0.30993538 -0.1365045
##         medv
## 1  1.6147330
## 2 -0.5152915
## 3  0.5733409
## 4  0.6621944
## 5 -0.7713403
## 6 -0.1747229
## 
## Coefficients of linear discriminants:
##                   LD1         LD2          LD3         LD4         LD5
## crim    -0.0305347753 -0.08591377 -0.108258758  0.01370669  0.12793637
## zn      -0.2349745638 -0.08924451 -0.895982678 -1.47547778  0.34969495
## indus    0.1256678211 -0.64511317  1.388877801 -1.77654417  0.39972164
## chas     5.8325152872  0.24741612 -0.311207345 -0.07443844 -0.27828298
## nox     -0.0025874888  0.27316108  0.279998181 -0.33251185  0.43887081
## rm      -0.0806099076 -0.01151686 -0.148388969  0.12537565  0.56954956
## age      0.0721570288 -0.01868907  0.362844031  0.22642315  0.24959116
## dis      0.0698432704  0.27583463 -0.218044854 -0.54697937 -0.14293536
## rad      0.2608126124 -3.31540916 -1.507829403  1.14007512  0.04287448
## tax     -0.0049952119  0.23295413 -0.381672532 -0.48815038 -0.21073890
## ptratio -0.0007726087  0.08817263  0.041915486 -0.04862072 -0.32527265
## black    0.0078197790  0.09331354 -0.003179723  0.02328372 -0.03306096
## lstat   -0.1245078745 -0.13037037 -0.008198020 -0.03449407  0.25315537
## medv    -0.1868981610  0.34078428  0.026517982  0.20032076  0.88234880
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5 
## 0.6255 0.2484 0.0727 0.0387 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
                   x1 = myscale * heads[,choices[1]], 
                   y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes_bonus <- as.numeric(km$cluster)

# plot the lda results
plot(lda.fit_bonus, dimen = 2, col=classes_bonus, pch = classes_bonus)
lda.arrows(lda.fit_bonus, myscale = 1.2)

The most influential linear separators for the clusters are chas, rad & indus.

Super-Bonus:
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plotly::plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

The 3D plot helps with the separation of clusters that are overlapping on two axes.

###FIN###